a broad class of statistical models that extend linear and generalized linear models to handle data where observations are measured within discrete groups such as field sites; years or other temporal blocks; individuals that are observed multiple times; genotypes; species; etc. They can be thought of (equivalently) as (1) accounting for the correlation among observations from the same group; (2) estimating the variability among groups, or (3) parsimoniously estimating the effects of groups. They are most useful when the experimental or observational design includes a large number of groups with varying numbers of observations per group.
g (must be discrete!) and a varying term f(f | g) in most R MM packagesf varies across groups defined by g (f = 1 → intercept)f for each group g estimated by shrinkage (empirical Bayes, joint Bayesian prior, …)cf. Crawley (2002), Gelman (2005)
lme4; gamm4 to deal with spatial autocorrelationggplot2, maybe some tidyversebroom.mixed, DHARMa, car (using :: notation as appropriate)Nested: sub-unit IDs only measured within a single larger unit. e.g.: Plot1 in Block1 independent of Plot1 in Block2
Crossed: sub-unit IDs can be measured in multiple larger units. e.g. year, site
Unique coding: removes ambiguity
Robert Long, Cross Validated
n parameters (intercept + slope = 2)
n*(n+1)/2 parametersn independent effects (|| shortcut); only n parameters
knitr::include_graphics("pix/gelman_hill_complexity.png")
Gelman and Hill (2006)
knitr::include_graphics("pix/uriarte_yackulic_complexity.png")
Uriarte and Yackulic (2009)
Load packages up front, note what they’re used for …
library(tidyverse); theme_set(theme_bw())
library(lme4)
library(gamm4)
## diagnostics
library(DHARMa)
library(car) ## influencePlot
## extraction/graphics
library(broom.mixed)
library(dotwhisker)
library(gridExtra)
## from GitHub: see utils.R
library(r2glmm) ## bbolker/r2glmm
library(remef) ## hohenstein/remef
source("utils.R")
source("gamm4_utils.R")
dd <- readRDS("data/ecoreg.rds")
Single-level model (biomes), intercept variation only. All pairwise interactions of main variables ((...)^2), plus (log of) ecoregion area:
m1 <- lmer(mbirds_log ~ log(area_km2) + (Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc)^2 +
(1 | biome),
data = dd)
## may get
## Warning message:
## Some predictor variables are on very different scales: consider rescaling
Best practice: check diagnostics as early as possible (before summary(), coeff plots) to reduce snooping.
plot(m1, type = c("p", "smooth"))
## heteroscedasticity
plot(m1, sqrt(abs(resid(.))) ~ fitted(.), type = c("p", "smooth"))
car::influencePlot(m1)
plot(DHARMa::simulateResiduals(m1))
## basic coefficient plot
dotwhisker::dwplot(m1, effects="fixed") + geom_vline(xintercept = 0, lty = 2)
## ordered coefficient plot (from utils.R)
dwplot_ordered(m1, effects = "fixed")
update() is your friendm2 <- update(m1, . ~ . - (1|biome) + (1 + Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | biome))
## compare plots
dwplot_ordered(list(intercept_only = m1, full = m2), effects = "fixed")
Now go to the (almost) maximal model; variation of main effects at all three geographic levels
## ~ 30 seconds
max_model <- lmer(mbirds_log ~ log(area_km2) + (Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc)^2 +
(Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | biome) +
(Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | flor_realms) +
(Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | biome_FR),
data = dd,
## for speed/skip convergence warnings
control = lmerControl(calc.derivs = FALSE))
isSingular(max_model)
## [1] TRUE
lwr <- getME(max_model, "lower"); theta <- getME(max_model, "theta"); min(theta[lwr == 0])
## [1] 0
VarCorr(max_model)
## Groups Name Std.Dev. Corr
## biome_FR (Intercept) 0.108507
## Feat_log_sc 0.028953 -0.998
## Feat_cv_sc 0.058037 0.968 -0.975
## NPP_log_sc 0.112513 -0.011 -0.024 0.240
## NPP_cv_sc 0.036552 -0.987 0.991 -0.996 -0.148
## biome (Intercept) 0.126920
## Feat_log_sc 0.011041 -1.000
## Feat_cv_sc 0.050648 0.937 -0.937
## NPP_log_sc 0.148073 -0.467 0.467 -0.748
## NPP_cv_sc 0.091318 -0.619 0.619 -0.855 0.984
## flor_realms (Intercept) 0.229506
## Feat_log_sc 0.130446 0.561
## Feat_cv_sc 0.072527 0.396 0.442
## NPP_log_sc 0.118908 -0.360 -0.965 -0.514
## NPP_cv_sc 0.031674 0.246 -0.662 -0.095 0.792
## Residual 0.193242
allFit(), diagnose/evaluate differences in effects of interestfor loop over tablereformulate() is usefulall_vars <- "1 + Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc"
v1 <- expand.grid(c("1 | ", paste(all_vars, "|"), paste(all_vars, "||")),
c("biome", "flor_realms", "biome_FR"))
v2 <- sprintf("(%s)", apply(v1, 1, paste, collapse = " "))
## use cross_df instead of expand.grid, want chars
v3 <- cross_df(list(biome = v2[1:3], FR = v2[4:6], biome_FR = v2[7:9]))
v3[1,]
## # A tibble: 1 × 3
## biome FR biome_FR
## <chr> <chr> <chr>
## 1 (1 | biome) (1 | flor_realms) (1 | biome_FR)
model_list <- list()
p1 <- proc.time()
for (i in 1:nrow(v3)) {
cat(i, unlist(v3[i,]), "\n")
form <- reformulate(
c(sprintf("(%s)^2", all_vars), ## main effects and 2-way interax
"log(area_km2)", ## plus ecoregion area
unlist(v3[i,])), ## plus specified REs
response = "mbirds_log")
model_list[[i]] <- lmer(form, data = dd) ## fit model
}
saveRDS(model_list, file = "data/model_list.rds")
proc.time() - p1
model_list <- readRDS("data/model_list.rds")
aic_vec <- sapply(model_list, AIC)
is_sing <- sapply(model_list, isSingular)
conv_warn <- sapply(model_list, has_warning)
tibble(model=1:27, aic_vec, is_sing, conv_warn) %>% arrange(aic_vec)
## # A tibble: 27 × 4
## model aic_vec is_sing conv_warn
## <int> <dbl> <lgl> <lgl>
## 1 19 35.7 FALSE FALSE
## 2 16 38.6 TRUE FALSE
## 3 25 39.7 TRUE FALSE
## 4 10 40.9 TRUE FALSE
## 5 21 43.2 TRUE FALSE
## 6 27 46.2 TRUE FALSE
## 7 18 46.8 TRUE FALSE
## 8 12 48.4 TRUE FALSE
## 9 9 49.5 FALSE FALSE
## 10 13 51.2 FALSE TRUE
## # … with 17 more rows
best_index <- which(aic_vec == min(aic_vec) & !is_sing & !conv_warn)
best_model <- model_list[[best_index]]
best_model
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## mbirds_log ~ (1 + Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc)^2 +
## log(area_km2) + (1 | biome) + (1 | flor_realms) + ((1 | biome_FR) +
## (0 + Feat_log_sc | biome_FR) + (0 + Feat_cv_sc | biome_FR) +
## (0 + NPP_log_sc | biome_FR) + (0 + NPP_cv_sc | biome_FR))
## Data: dd
## REML criterion at convergence: -4.3321
## Random effects:
## Groups Name Std.Dev.
## biome_FR NPP_cv_sc 0.07758
## biome_FR.1 NPP_log_sc 0.13922
## biome_FR.2 Feat_cv_sc 0.08691
## biome_FR.3 Feat_log_sc 0.05442
## biome_FR.4 (Intercept) 0.13560
## biome (Intercept) 0.10908
## flor_realms (Intercept) 0.23103
## Residual 0.19400
## Number of obs: 620, groups: biome_FR, 55; biome, 14; flor_realms, 6
## Fixed Effects:
## (Intercept) Feat_log_sc Feat_cv_sc
## 5.714624 0.091608 -0.006491
## NPP_log_sc NPP_cv_sc log(area_km2)
## 0.260378 0.066486 -0.029917
## Feat_log_sc:Feat_cv_sc Feat_log_sc:NPP_log_sc Feat_log_sc:NPP_cv_sc
## -0.011307 -0.051931 -0.047096
## Feat_cv_sc:NPP_log_sc Feat_cv_sc:NPP_cv_sc NPP_log_sc:NPP_cv_sc
## 0.013720 -0.036622 0.006426
plot(best_model, type = c("p", "smooth"))
## heteroscedasticity
plot(best_model, sqrt(abs(resid(.))) ~ fitted(.), type = c("p", "smooth"))
car::influencePlot(best_model)
plot(sr <- simulateResiduals(best_model))
Computes residuals at population level (usually a good default but not necessarily appropriate here?)
DHARMa::plotResiduals(sr, dd$NPP_log_sc)
pop_resids <- model.response(model.frame(best_model)) - predict(best_model, re.form=NA)
p1 <- lattice::xyplot(pop_resids ~ dd$NPP_log_sc, cex = 1, type = c("p", "smooth"),
main = "pop resids (DHARMa)")
p2 <- plot(best_model, resid(.) ~ NPP_log_sc, type=c("p", "smooth"),
main = "unit resids (lme4)")
gridExtra::grid.arrange(p1, p2, nrow = 1)
Easiest to explore spatial correlations graphically:
dd$res1 <- residuals(best_model)
ggplot(dd, aes(x, y, colour = res1, size = abs(res1))) +
geom_point() +
scale_colour_gradient2() +
scale_size(range=c(2,7))
gamm4Quickest (not necessarily best) fix for spatial autocorr; bs="sos" is a spherical smooth
gamm4_form <- update(formula(best_model), . ~ . + s(y, x, bs="sos"))
best_gamm4 <- gamm4(formula = nobars(gamm4_form),
random = as.formula(reOnly(gamm4_form)),
data = dd)
class(best_gamm4) <- c("gamm4", "list") ## so we can tidy etc. (using gamm4_utils.R)
dwplot_ordered(list(best_gamm4, best_model), effects="fixed")
dotwhisker::dwplot() as shown before, or broom.mixed::tidy() + ggplot tweaking
rsqvals <- r2glmm::r2beta(best_model, method = "sgv", partial = TRUE)
## set (reverse)factor order (or forcats::fct_inorder())
rsqvals$Effect <- factor(rsqvals$Effect, levels = rev(rsqvals$Effect))
ggplot(rsqvals, aes(x=Rsq, xmin = lower.CL, xmax = upper.CL, y = Effect)) +
geom_pointrange()
NPPvec <- with(dd, seq(min(NPP_log_sc), max(NPP_log_sc), length.out = 51))
pred <- emmeans::emmeans(best_model, ~NPP_log_sc, at = list(NPP_log_sc = NPPvec))
## NOTE: Results may be misleading due to involvement in interactions
gg0 <- ggplot(as.data.frame(pred),
aes(NPP_log_sc, emmean)) + geom_line() +
geom_ribbon(aes(ymin=lower.CL, ymax = upper.CL), alpha = 0.2, colour = NA)
gg1 <- gg0 + geom_point(data = dd, aes(y = mbirds_log, colour = biome))
print(gg1)
Can also back-transform, unscale, etc. (see e.g. Stack Overflow Q1, Stack Overflow Q2)
Compute partial residuals and display:
dd$remef <- remef::remef(best_model, fix = "NPP_log_sc")
grid.arrange(gg1,
gg0 + geom_point(data = dd, aes(y = remef, colour = biome)) +
scale_y_continuous(limits = range(dd$mbirds_log)),
nrow=1)
dotplot, qqmath methodsas.data.frame(ranef()) or broom.mixed::tidy(model, effects = "ran_vals")grid.arrange(grobs = lattice::dotplot(ranef(best_model)), nrow=1)
glmerglmmTMB, brmsGLMMadaptive: or go Bayesian)blme, MCMCglmm, rstanarm, brms …beyond intercept-only and diagonal structures
(1|g/f) instead of (f|g) for factor termscs() in glmmTMB, other possibilities in nlme::lme, MCMCglmm, brms, …mgcv: Markov random field, etc.brms: can use all mgcv smoothsINLA: probably best for spatial/temporal, but a whole other worldglmmTMB: simple space/time stuff with complex GLMMsBarr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3): 255–78. https://doi.org/10.1016/j.jml.2012.11.001.
Bates, Douglas, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015. “Parsimonious Mixed Models.” arXiv:1506.04967 [Stat], June. http://arxiv.org/abs/1506.04967.
Bolker, Benjamin M. 2015. “Linear and Generalized Linear Mixed Models.” In Ecological Statistics: Contemporary Theory and Application, edited by Gordon A. Fox, Simoneta Negrete-Yankelevich, and Vinicio J. Sosa. Oxford University Press.
Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. 2009. “Generalized Linear Mixed Models: A Practical Guide for Ecology and Evolution.” Trends in Ecology & Evolution 24: 127–35. https://doi.org/10.1016/j.tree.2008.10.008.
Crawley, Michael J. 2002. Statistical Computing: An Introduction to Data Analysis Using S-PLUS. John Wiley & Sons.
Gelman, Andrew. 2005. “Analysis of Variance: Why It Is More Important Than Ever.” Annals of Statistics 33 (1): 1–53. https://doi.org/doi:10.1214/009053604000001048.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, England: Cambridge University Press. http://www.stat.columbia.edu/~gelman/arm/.
Matuschek, Hannes, Reinhold Kliegl, Shravan Vasishth, Harald Baayen, and Douglas Bates. 2017. “Balancing Type I Error and Power in Linear Mixed Models.” Journal of Memory and Language 94: 305–15. https://doi.org/10.1016/j.jml.2017.01.001.
McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman; Hall/CRC.
Olson, D. M., E. Dinerstein, E. D. Wikramanayake, N. D. Burgess, G. V. Powell, E. C. Underwood, J. A. D’amico, et al. 2001. “Terrestrial Ecoregions of the World: A New Map of Life on Earth: A New Global Map of Terrestrial Ecoregions Provides an Innovative Tool for Conserving Biodiversity.” BioScience 51 (11): 933–38.
Schielzeth, Holger. 2010. “Simple Means to Improve the Interpretability of Regression Coefficients.” Methods in Ecology and Evolution 1: 103–13. https://doi.org/10.1111/j.2041-210X.2010.00012.x.
Schielzeth, Holger, and Wolfgang Forstmeier. 2009. “Conclusions Beyond Support: Overconfident Estimates in Mixed Models.” Behavioral Ecology 20 (2): 416–20. https://doi.org/10.1093/beheco/arn145.
Uriarte, María, and Charles B. Yackulic. 2009. “Preaching to the Unconverted.” Ecological Applications 19 (3): 592–96. http://www.jstor.org/stable/27646001.
Wood, Simon. 2017. Generalized Additive Models: An Introduction with R. 2d ed. CRC Texts in Statistical Science. Chapman & Hall.
Zuur, Alain F., Elena N. Ieno, Neil J. Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed Effects Models and Extensions in Ecology with R. Springer.